Aminet 30

home *** CD-ROM | disk | FTP | other *** search

/ Aminet 30 / Aminet 30 (1999)(Schatztruhe)[!][Apr 1999].iso / Aminet / text / misc / TextInfo.lha / TextInfo / TextInfo.doc < prev next >

Wrap

Text File | 1999-01-07 | 17KB | 402 lines

********************************************************************** * TextInfo 1.3 by Erik Spåre (Parsec/Phuture 303) 990107 * * Filematching routines by Anders Vedmar (Axehandle) * ********************************************************************** IS THIS A PROGRAM FOR YOU? TextInfo's task is to count all the various _unique_ words in texts and list them. "Mine your mine" = 3 words, 2 unique (mine, your). Read. If you find the following facts interesting, then this may be a program for you. ! Two Cities by Dickens has almost twice as many unique words as the Koran, despite the fact that the Koran is bigger. (Two Cities 8041/138384, Koran 4300/152164). Plato's the Republic (translated, as the Koran) has 45 % more (6199/127005). ! Moby Dick has 13403/213486 words; that is 27 % more than the Bible's 10560/812394, even though Moby Dick's size is only about one quarter of the Bible's! ! A friend (greetings Théonore) told me that he had heard that there was a word in the Bible that occured 666 times, and it was the name of the Beast. There is no such word in King James Bible... :( ! You need only to know the meaning of 48 words to understand half of the words written in the Bible... (37 in the Koran, 44 in the Republic, 64 in Two Cities and 87 in Moby Dick). It is indeed as Axehandle said: "One would have supposed that Alah had a bigger vocabulary..." TEXTINFO IS... 100 % Assembler (d'oh!) Freeware REQUIREMENTS OS 2.04+ and a mind that is more interested in how many grains of sand the average beach contains, than the latest sport results or soap operas. INTRODUCTION "`Don't be afraid to hear me. Don't shrink from anything I say. I am like one who died young. All my life might have been.' `Is it not--forgive me; I have begun the question on my lips--a pity to live no better life?' `God knows it is a shame!'" / Charles Dickens, Two Cities. "Here is wisdom. Let him that hath understanding count the number of the beast: for it is the number of a man; and his number is Six hundred threescore and six." / The Bible. "O ye who believe! Approach not coding while ye are drunk, until ye well know what ye type." / The Koran. The idea to this program was given to me one day when I was reading the Koran. It was so amazingly boring, so, naturally, my thoughts were not very busy with the text that some low priority subroutine of my brain supplied, but with the things I could do after I had finished reading the current chapter. Sometimes I discussed with myself how bored I was. "Have I ever been this bored?" "I don't know." "How is it possible to constantly reiterate similar phrases, and call the result wisdom?" "He was probably drinking wine -- laudanum most likely -- and got a bit excited." And one day the idea to this program was born. "I wonder how many different words this book contains." "Not many, I'm sure." "No... I wonder how many..." "Maybe you could count them with a clever program?" The program proved me right, needless to say; but it also made me interested in other things about etexts, like how many words that is responsible for a certain percentage of the word total... The idea to release this program seemed in the beginning a bit absurd -- why would anyone use it? -- but now... I don't think that just I and Axehandle find things like this interesting. WHAT DOES THE PROGRAM DO? It goes through textfiles, counts the bytes, letters, words, unique words and more. Here is a sample of a list that was produced from all the textfiles (not exceeding 5MB) on the CD Project Gutenberg (Nov 94). Bytes read 91502237 Total amount of bytes read Letters 61101533 Total amount of letters Words 13925117 Total amount of words Unique 94816 Number of unique words Subsumes 19681 Words with word-stem + ending Syllabications 7210 Executed syllabications Truncated chars 294 Chars exceeding max wordlength Truncated words 27 Words that were too long Examined file/s 204 Files that matched the 791561 After the above info, the list and 496698 of words follow. Unless of 440919 specified otherwise, all the ... unique words are listed. zygote 1 zyuganov 1 aarons(1)=aaron Following the wordlist is the abhorred(79)=abhorr subsume report. The number abominations(246)=abomination within paranthesis is how many ... times the word with ending zulus(1)=zulu occured, before it was zugs(1)=zug subsumed to its wordstem. 5 % 1 Finally there's the percentage 10 % 3 list. The third line here 15 % 5 means that you only need to ... know 5 words to understand 90 % 4045 15 % of all the words that 95 % 8606 was examined. WHAT IS IT USEFUL FOR? Apart from giving you swarms of interesting arrows, it is not very useful. It will tell you how big your vocabulary is (try comparing the unique words found (with subsuming disabled) in your letters, written in your native language versus english.) If words like "fuck", "lame" or "cool" tops your letters, then maybe you should consider blushing... If you are curious about a certain etext, check the first noun; it will in many cases tell you the whole plot. I have tried this on five etexts... NAME MOST COMMON NONE Alice in Wonderland alice Hacker's Crackdown computer Moby Dick whale The Bible lord The Koran god If you have lots of texts in a foreign language that you wish to study, it could be a good idea to "start from the top"... Let's say that you didn't understand a word english, that your favourite author was Lewis Carol, and that you would love to read "Alice in Wonderland" as it was once originally written. Alice in Wonderland is 150 kb, and 1083 words = 95 % of it; if the Gutenberg results represents the english language (where 95 % = 8606 words) you would "spare" more than 7000 words if you used this method. Hmm... THE ARGUMENT LINE All options are case sensitive. Usage: TextInfo [-<OPTIONS>] <FILE/PATH> <DESTFILE> [ALL] <FILE/PATH> is either a file or a file pattern. <DESTFILE> is the name of the output file. ALL is to be used when you want recursive matching. Options -s Disable syllabication This will disable syllabication. By default syllabication is used, meaning that words that do not fit on one line, and thus are separated by a hyphen, will be connected and treated as one word. For instance "norr-[NEW LINE]sken" will be listed as "norrsken" when syllabication is on; otherwise it becomes 2 separate words. Same thing with "norr-[NEW LINE]-sken" or even "norr-[NEW LINE] sken". The only thing the syllabication routine requires is a letter followed by a hyphen and a new line (CR and/or LF); it will then connect the first found letter or letters. There is one thing that will abort the syllabication, and that is when another hyphen is found within the word that is to be connected. E g "bread-[NEW LINE]and-butter" will be treated as three words. This is not allways good... -e Disable subsuming Use this if you want subsuming to be disabled. Subsuming is by default conducted on all the words that end on `s', `ed' or `ing', if, and only if, the stem-word is found. In "I have walked there, now I walk here" the word `walked' ends on `ed', and since its word-stem `walk' also is present, the word is subsumed to `walk'. But in "I like stars" `stars' is not subsumed to `star' since the word-stem is not present. The word-stem has to be at least three chars, so `his' in "I said hi to... what's his name..." won't be subsumed to `hi'. Subsuming is not always correct, it would take a dictionary to make it safe; for instance `cared' in "My boyfriend really cared for me in his car" will uncorrectly be subsumed to car... That's why there's a subsume report. -e<e1>,... Set endings to subsume This works as the above option (in fact, it is the same), it disables the default subsuming, but conducts subsuming on words with the endings that you specify. For instance "-eed,s,er" will perform subsuming on words ending with ed, s or er. There are 210 bytes reserved for the endings, then comes the percentages txt (in memory)... -r Disable subsume report Use this option if you do not want a subsume report in the output file. -p Disable percentage list This option will disable the percentage listing in the output file. By default the amount of words that make up for 5, 10, 15...upto 95 percent of the total sum of words, is listed in the end. I haven't checked this routine very much, so I am not 100 % sure that all the values are correct. -n[<n>] Set number of words to list Use this option to set the number of words to list in the output file. This does not affect the subsume report. Note: -n alone will disable the word list. -l<n> Set lowest number to list. The number n is required and tells the program to only list words that occured n times or more. If you would write "-l10000" only words that occured 10 thousand times or more will be listed. -m[<kb>] Set minimum filesize By default this is set to 50 bytes; that means that files smaller than 50 bytes will not be examined. If you don't specify a number, all sizes will be valid, else the minumum size is set to 1000*<kb> bytes. -M<kb> Set maximum filesize As the -m option, but here you _must_ specify a kb value. Files exceeding this size will not be loaded. -t<t>,<s> Set number of tabs,size By default the output words are displayed with a width of 24 chars, or 3 tabs with the size of 8. The maximum wordlength is derived from these numbers -- maxlen = tabs*tabsize-1. Number of tabs and tabsize may not exceed 9999. -z Don't abort on zero A textfile shouldn't contain the ascii value zero (except maybe as EOF-sign) therefore textinfo will by default stop the examination whenever zero occurs. Use the -z option to force the program to process all bytes in all files; this is useful if you have wordprocessor files, with long headers (bound to hide a zero somewhere). RUNNING TEXTINFO When textinfo is started, this is what will happen. First the argument line is checked; if it is invalid the program will exit with the short information text. If everything is ok the destination file will now be opened (and immediately closed) just to make sure this won't fail after an hour of intense counting. Now all the files that match the given pattern is checked to determine the maximum filesize; this amount of memory is then allocated. If there is not enough memory, the program will exit with an error message. This is when the real program starts. The first matching file will be loaded to the already allocated memory. (Initially I allocated a memory block just as large as the current file, processed it, and then freed it, but this turned out to make the memory heavily fragmented, and in the end there was seldom enough memory for the (often huge) final wordlist and output file). The file will be pre-examined in two passes. In the first pass the whole file will be lowercased and all non-letters will be set to zero. If syllabication is wanted, this will be done here. In the second pass _all_ words will be counted and in the same time truncated if they exceed the maximum allowed wordlength. There is no progress indicator for this, because it doesn't take much time. Now the file is safe to examine, and the main routine is called. It will check a word at a time; if the word has occured before, its counter will be incremented; if it is a new word, it will be added and given a counter set to 1. Every 256:th word, the progress indicator will be updated. When this is done, two numbers will be displayed: the first is the amount of words the file contained; the second is how many of them that had not been found before. The program then loads the next matching file. When all files have been examined, the memoryblock used for the loaded files is freed. The subsuming is now executed, unless it is disabled. After this the result is rearranged into a large wordlist. This wordlist is then sorted; all the words that occured 255 times or less is sorted instantly, the rest is bubble sorted. If you have, say 20 thousand words, that appeared more than 255 times, this will take some time, but normally you will hardly notice the sorting. The words are sorted in order of frequency, with the most usual word (probably `the') in the top. Words that occured an equal amount of times is sorted after their first 2 characters. Why only the first two? Because it is just a side-effect of the counting routine (a nice one for a change!). Although this means that the listing is in ascii-order, so swedish texts (for instance) will unfortunally be listed with the ä-words before the å-ones. Ah well! Finally the output file will be created and written to the destination. ABOUT THE CODE... I do not wish to brag, so let me just say that the main routine for counting the various words is amazingly fast. (Right Axehandle?) However, the additional (boring) routines for making everything foolproof, and the progress display, have slowed it down somewhat. Still, if you disable syllabication and subsuming, counting and sorting all the words in the Koran (0.8MB) and creating an output file takes only 8 seconds on my A1200 (slightly more than half the time Dopus needed just to count the lines). The Bible (66 files with a total of about 4.6MB) takes 50 seconds. (My first version of this program needed 45 minutes to go through the Koran! The second version was a bit more efficient, and actually 29 THOUSAND percent faster (before optimization)! A ratio that would make any programmer drool...) The subsume routine has not been optimized. BUGS??? I have never encountered any bugs in this version. However, I have only been able to check 90 MB texts at a time, so I cannot be completely sure how it works on say 1GB of data. The number displays can only handle 10 digits, or one unsigned longword. THE PHUTURE There will probably not be any update on this program, unless I get one single request from some unknown textinfo user (that's all the motivation I need!). It has been suggested to me by Axehandle, that there should be an option to add results to an already existing wordlist. That way it would be possible to create one huge wordlist formed by many, many CD-Roms. XPK/LZX/LHA/ZIP support would also be very useful, since most CD:s pack their texts. CONTACT ME... If you want an update made, or something else, write... EMail: blodskam@ebox.tninet.se (valid to end of july 1999, after that use: blodskam@hotmail.com) I have, btw, used the handle Parsec since the summer of 1991. I know many people think handles are silly (albeit /nicks are "kewl"), but... It's a silly life! Take it seriousley, and *you* are the fool! HISTORY v1.0 (960320) ** First public release v1.1 (971217) ** According to a friend (Thomas Richter I believe) TextInfo would crash if no endquoute was found in the filepattern. Fixed this. v1.2 (980112) ** The progress indicator is now adapted after the shell width. Thanks to Finn Nielsen for giving me a routine that demonstrated how to do this. I still haven't got a request for more features, although I recieved an email from someone who had at least tried the program. v1.3 (990107) ** When I wanted to include the output in an email, the mailer of course couldn't handle the tabs. I tried to circumvent this by specifying -t32,1, hoping that TextInfo would make the output 32 characters wide... but when it didn't work I vaguely remembered being too lazy to accept more tabs than 9 (one digit only) and so I fixed this and made sure that spaces are printed instead of tabs, if the tabsize is set to 1. This is all for this release, still no request for more features. Perhaps the program is perfect now? :)